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Abstract 

We introduce a scalable algorithm, MUCCA for multiclass node clas- 
sification in weighted graphs. Unlike previously proposed methods for 
the same task, MUCCA works in time linear in the number of nodes. 
Our approach is based on a game-theoretic formulation of the problem 
in which the test labels are expressed as a Nash Equilibrium of a certain 
game. However, in order to achieve scalability, we find the equilibrium 
on a spanning tree of the original graph. Experiments on real-world data 
reveal that MUCCA is much faster than its competitors while achieving 
a similar predictive performance. 



1 Introduction 



Classification of networked data is a quite attractive field with applications in 
computer vision, bioinformatics, spam detection and text categorization. In 
recent years networked data have become widespread due to the increasing 
importance of social networks and other web-related applications. This growing 
interest is pushing researchers to find scalable algorithms for important practical 
applications of these problems. 

In this paper we focus our attention on a task called node classification, often 
studied in the semi-supervised setting Zhu et al. 2003 . Recently, different teams 



studied the problem from a theoretic point of view with interesting results. For 



example Cesa-Bianchi et al. 2009 2010 2011 developed on-line fast predictors 
for weighted and unweighted graphs and Herbster et al. developed different 



versions of the Perceptron algorithm to classify the nodes of a graph ( Herbster 



et al. 2009 , Herbster and Pontil 2006 ). 



Erdem and Pelillo 2011 introduced a game-theoretic framework for node clas- 



sification. We adopt the same approach and, in particular, we obtain a scalable 
algorithm by finding a Nash Equilibrium on a special instance of their game. 
The main difference between our algorithm and theirs is the high scalability 
achieved by our approach. This is really important in practice, since it makes 
possible to use our algorithm on large scale problems. 



2 Basic Framework 



Given a weighted graph G — {V, E, W), a labehng of G is an assignment y = 
(yi,-,yn) e {0,l,...,c}" where n= \V\. 

We expect our graph to respect a notion of regularity where adjacent nodes often 
have the same labeh this notion of regularity is called homophily. Most machine 



learning algorithms for node classification (Herbster et al. 12009 , Cesa-Bianchi 



et al. 


2010 , 


Zhu et al. 


2003 , 



Cesa-Bianchi et al. 2011| ) adopt this bias and 



exploit it to improve their performances. 
The learner is given the graph G, but just a subset of y, that we call training 
set. The learner's goal is to predict the remaining labels minimizing the number 
of mistakes. 

introduce also an irregularity measure of the graph 



Cesa-Bianchi et al. 2010 



G, for the labeling y, defined as the ratio between the sum of the weights of 
the edges between nodes with different labels and the sum of all the weights. 
Intuitively, we can view the weight of an edge as a similarity measure between 
two nodes, we expect highly similar nodes to have the same label and edges 
between nodes with different labels being "light". Based on this intuition, we 
may assign labels to non-training nodes so to minimize some function of the 
induced weighted cut. 

In the binary classification case, algorithms based on min-cut have been pro- 
posed in the past (for example Blum and Chawla 2001 ). Generalizing this 
approach to the multiclass case, naturally takes us to the multi-way cut (or 
multi-terminal cut — see Costa et al. 2005| ) problem. Given a graph and a 
list of terminal nodes, find a set of edges such that, once removed, each termi- 
nal belongs to a different component. The goal is to minimize the sum of the 
weights of the removed edges. 

Unfortunately, the multi-way cut problem is MAX SNP-hard when the number 

1994] ). Further 



of terminals is bigger than two ( |Dalhaus et al. 



algorithms to find the multi-way cut on special instances of the problem are 
known, but, for example, it is not clear if it is possible to reduce a node classi- 
fication problem on a tree to a multi-way cut on a tree. 



3 Graph Transduction Game 



In this section we describe the game introduced by Erdem and Pelillo 2011 that, 
in a certain sense, aims at distributing over the nodes the cost of approximating 
the multi-way cut. This is done by expressing the labels assignment as a Nash 
Equilibrium. We have to keep in mind that, since this game is non-cooperative, 
each player maximizes its own payoff disregarding what it can do to maximize 
the sum of utilities of all the players (the so-called social welfare). The value 
of the multi-way cut is strongly related to the value of the social welfare of the 
game, but in the general case a Nash Equilibrium does not give any guarantee 
about the collective result. 

In the Graph Transduction Game (later called GTG), the graph topology is 
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known in advance and wc consider each node as a player. Each possible label 
of the nodes is a pure strategy of the players. Since we are working in a batch 
setting, we will have a train/test split that induces two different kind of players: 

• Determined players(/D) those are nodes with a known label (train set), 
so in our game they will be players with a fixed strategy (they do not 
change their strategy since we can not change the labels given as training 
set) 

• Undetermined players(/j/) those that do not have a fixed strategy and 
can choose whatever strategy they prefer (we have to predict their labels) 

The game is defined as F = (7, S, tt), where I = {1, 2, n} is the set of players, 

S = X-ieiSi is the joint strategy space (the Cartesian product of all strategy 
sets Si C {1,2, ...c}), and tt : 5 — )■ K" is the combined payoff function which 
assigns a real valued payoff 7rj(s) e M to each pure strategy profile s G S and 
player i I. 

A mixed strategy of player i e / is a probability distribution x over the set of 
the pure strategies of i. Each pure strategy k corresponds to a mixed strategy 
where all the strategies but the k-th one have probability equals to zero. 
We define the utility function of the player i as 



where x{s) is the probability of s. 

We assume the payoff associated to each player is additively separable (this will 
be clear in the following lines). This makes GTG a member of a subclass of 
the multi-player games called polymatrix games. For a pure strategy profile 
s = (si, S2, ■■■Sn) € S, the payoff function of every player i G I is: 



where i ^ j means that i and j are neighbors, this can be written in matrix 
form as 



where Aij e W^'' is the partial payoff matrix between i and j, defined as 
A'ij — Ic ^ Wij, where Ic is the identity matrix of size c and Aij{x, y) represent 
the element of at row x and column y. The utility function of each player 
i G lu can be re-written as follows: 






Ui{x) 
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where k is an action selected from the player's set and in case i is a determined 
node with training label k, x's components will be always zeros except the fc-th 
corresponding to the pure strategy k. Since the utility function of each player 
is linear, it is easy to see that players can achieve their maximum payoff using 
pure strategies. 

In a non-cooperative game, a vector of strategies Sne is said to be a (pure 
strategies) Nash Equilibrium, if Vi G /, Vs^ G Si : s'^ ^ Si £ Sne, we have that 

Ui{si, Sj^j^) > Ui{si, Sjf^j^) 

where Ui{si, S^^) is the strategy configuration S except the i-th one, that is re- 
placed by Si- In practice, no player i will change its strategy Si to an alternative 
strategy improving its payoff. 

There are no guarantees that the Nash Equilibrium exists in pure strategies, 
but any game with a finite set of players and finite set of strategies has a Nash 
Equilibrium in mixed strategies (Nash 1951 , also see Nisan et al. 2008 ). In this 



case each player does not have to choose a strategy but it mixes its choices over 
its strategies. Instead of maximizing its payoff, it will maximize its expected 
payoff. 

Abusing of terminology, in the following sections we may talk about labels or 
pure strategies with the same meaning. 



3.1 The Evolutionary Stable Strategies approach 



Erdem and Pelillo 2011 propose to find a Nash equilibrium of the GTG using 
the Evolutionary Stable Strategies. We briefly present their approach in order 
for the reader to better understand the difference between their algorithm and 
ours. 



The evolutionary stable strategies (ESS) approach is well known ( Weibull 1995 
in the game-theoretic literature. It considers a game played repeatedly; each 
repetition of the game is seen as a generation, where an imaginary popula- 
tion evolve through a selection mechanism that, at each step, gives to the best 
"choices" a growing portion of the total population. 

The algorithm (later called GTG-ESS), at each generation, updates the proba- 
bility associated to every action h of every player i as 



Xih{t + 1) = Xth{t) 



Ujjeh) 

Ui{x{t)) 



The previous formula is just the discrete version of the so-called multi-population 
replicator dynamic: 

iih = Xih{ui{eh,x^i) - Ui{x)) 

where e/j is a vector of zeros except the /i-th component that is one and Xih 
is the h-th strategy of player i. The fixed points of the previous equations are 
Nash Equilibria, and the discrete version has the same properties — for further 



details see Erdem and Pelillo 2011 
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Figure 1: Some of the elements introduced in this paragraph: grey nodes are 
labeled nodes, the node with two circles is a fork, the fat black edges are those 
flagged "black line" , the thin solid black line edges are edges of the spanning 
tree and the dotted edges are edges of the original graph not selected for the 
spanning tree. 



In this case, the computational cost of finding the Nash Equilibrium is 0(fc|Fp) 
where k is the number of iterations and considering the number of classes as a 
constant factor. Erdem and Pelillo 2011 experimentally found that the number 
of iterations grows linearly with the number of nodes, so they consider the 
running time close to 0(|yp), but they do not seem to have any upper bound 
on the number of iterations. 



4 The MUCCA algorithm 



Use GTG-ESS for large scale networks can not be considered a viable solution, 
even if the time complexity of the algorithm were to be demonstrated in the order 
of 6(|T/p). A possible alternative is to apply some known results about regret 

in 



minimization, such as those described by Cesa-Bianchi and Lugosi 2006 
order to converge to a weaker notion of equilibrium, for example the Correlated 
Equilibrium (Aumann 1974| ). Unfortunately, the results of our experiments 



with the Correlated Equilibrium were not satisfactory. 
In this section we present MUCCA: a multiclass Classification Algorithm. Our 
algorithm consists in finding a Nash Equilibrium of the Graph Transduction 
Game on a special graph: a tree. We will show that in this way we achieve 
both good accuracy and scalability. In the remaining part of this section we 
will assume that the graph G is a tree. 

Now, we briefly introduce few notions those will be useful later in this section 
(see Figure [I]): 

• Revealed node: a node whose label is known 

• Unrevealed node: a node whose label is unknown 

• Fork: an unrevealed node that is connected to at least 3 revealed nodes 
by edge-disjoint paths 
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• Hinge node: a revealed node or a fork 

• Hinge tree: component of the forest created by removing from G all the 

edges incident to hinge nodes 

• Native hinge tree: component of the forest created by removing from G 
all the edges incident to revealed nodes. Its connection nodes are intended 
to be only labeled nodes. 

• Hinge line: a path connecting two hinge nodes such that there are no 
internal nodes those are hinge nodes 

• Connection Node: an hinge node that is adjacent to a node in an hinge 
tree 

• 6-edge: given Vi j, the path between i and j, an e-edge is Cij G argmineep^^ Wg, 

where Wc is tlie weiglit of the edge e. 

• Grafted tree: a tree without hinge nodes that is connected to just one 
node on an hinge line 

MUCCA works in four phases: 

1. Mark all the paths between revealed nodes and find all the fork nodes 

2. Estimate the label of each fork node 

3. Assign a label to all the nodes on the hinge lines using a min cut technique 

4. Assign a label to all the remaining nodes 

Now we describe what it does in every phase: 
Phase 1 

Starting from each revealed node, it does a breadth-first search until another 
revealed node or a leaf are found. Then if a revealed node was found, during 
the backtracking phase, MUCCA marks the edges on the path to the starting 
node with a special flag that we will call "black line". After that, each node 
with more than 3 edges marked as "black line" incident to it, is a fork. 
Phase 2 

Given a native hinge tree T-L that contains the fork J^, we can categorize its 
connection nodes in c categories using their labels. For each path between T 
and each connection node of H, we have an e-edge as defined before. The label 
assigned to is the same as the category (of the connection nodes) that has 
the maximum sum of weights over the distinct e-edges on the path between T 
and the nodes of that category. 
Phase 3 

On every hinge line, we label the nodes using min cut: in case the hinge nodes 
at the beginning and at the end of the line has the same label, all the nodes on 
the hinge line will be labeled with that label. Otherwise, all the nodes before 
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the e-edge arc assigned with the label of the node at the beginning of the line, 
and the others with the label of the node at the end of the line. In case we 
have more than one edge with the same weight of the e-edge (for example all 
the edges have the same weight), we use nearest neighbor to choose the labeling 
on the line. 
Phase 4 

All the remaining nodes arc in grafted trees. In this case, we assign the label 
of the node on the hinge line (connected to the tree) to all the nodes in that 
particular grafted tree. 

We now prove that MUCCA finds a Nash Equilibrium for this special case 
of the GTG. 

Theorem. The labeling found by MU CCA is a Nash Equilibrium of the Graph 
Transduction Game when the graph is an undirected tree. 

Proof. As we explained in Section 3, a profile of strategies Sne is a Nash Equi- 
librium if no one has incentive to deviate from its strategy. This means that 
Vi e /, Wi(si,S'^^) > Ui{s\,Sjf\). For the purpose of contradiction suppose 
that exists a node j such that it can improve its payoff by changing its strategy. 
j can not be contained in a grafted tree (those labeled in phase 4) since all the 
nodes contained in those trees have the same labels, so, whatever is the label, 
each of them gets a payoff of X^j^^ Wij, the maximum possible payoff. 
j can not be on an hinge line since they are labeled using min cut, so in the best 
case the payoff of each node is already the maximum payoff; in the worst case 
the payoff is the maximum minus the weight of the e-edge. Since the e-edge, by 
definition, has the minimum weight, there are no chance to improve the payoff. 
j can not be a revealed node (for obvious reasons). 

j can not be a fork. Since we use min cut to label the hinge lines if the e-edge of 
an hinge line is not incident to the fork, the node adjacent to the fork on that 
hinge line will have the same label of the fork. In this way the fork will get the 
part of payoff given by the edge between it and the adjacent node. Even if the 
e-edges are incident to the fork, the label prediction can not achieve a payoff 
better than the one achieved by the majority vote. 

Since j can not be a revealed node, nor a fork, nor a node on an hinge line, nor 
a node on a tree with just a connection node, it can not be in G. □ 

Since we consider the number of forks as a constant factor, MUCCA runs in 
0(|F|). Note that the labels of the unlabeled nodes of every native hinge tree 
can be predicted just using information about that singular native hinge tree. 
In this way, once the tree is splitted in native hinge trees, the predictions for 
the labels contained in every subtree are independent from the other subtrees. 
So, they can be computed using different threads, processes or even machines 
and we just need to get back a list of points (node id, predicted label). 
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5 Experiments 



In this section we present our experimental results. Since binary classification 
is a (well-studied) special case of our problem, we first compare MUCCA with 
the state of the art in this particular field. In the second part of this section we 
test our algorithm versus its competitors on multiclass problems. We have to 
keep in mind that MU CCA works in linear time and can be used on large-scale 
graphs, where we can not test its competitors. 

Our experimental protocol is quite simple: for every size of the training set 
(and possibly on every class), we did 10 runs for each algorithm. Algorithms 
working on trees were ran on: Minimum Spanning Trees since previous experi- 
mental works showed that predictors get their best results on this kind of tree 
( Cesa-Bianchi et al. 2011 ), and on Random Spanning Trees (RandomSpan- 



ningTrees) generated as in |Wilson| [1996 to test the most scalable solution pos- 
sible. We tested MUCCA also in a committee version: in the tables numbers 
before the predictor's name represent the number of predictors in the committee 
(for example n*MyPred will represents a committee of n predictors MyPred). 
Each predictor in the committee predicts its own labels using its own tree and 
then we aggregate the predictions with a majority vote over the committee. 
In order to better understand the competitiveness of AIU CCA, we will compare 
it with some well-know algorithms besides GTG-ESS. 



Label Propagation (abbreviated LABPROP), introduced by Zhu et al. 2003 
is a popular algorithm for node classification and one of the most accurate 
algorithms in the literature. LABPROP computes a real-valued function / : 
y — M on G, and then assigns a labeling to G using the values of /. The 
algorithm minimizes the following quadratic energy function: 



E{f) = lJ2w^jif{i)-m)' 

with constrains on the labeled nodes. Its running time is 0(|i?| x \V\). 



SHAZOO is a quite new algorithm introduced by Cesa-Bianchi et al. 2011 
and, at the best of our knowledge, it is the most accurate scalable algorithms 
between our competitors. SHAZOO has strong theoretical foundations, but 
the idea behind the algorithm is really different from our one: it is an online 
algorithm based on nearest neighbor. In Figure 2, you can see a toy example 
where SHAZOO and MUCCA behave in a different way: MUCCA predicts 
all the nodes on the left hand side as positive since there are two nodes labeled 
as positive, connected to a fork that is labeled as positive and the e-edge is the 
one with weight 1. Clearly every player wants to maximize its payoff, so the 
algorithm produces a labeling with a minimum possible cut. In the second part, 
for SHAZOO the node in the middle looks closer to the negative example than 
to the positive ones. In this way also all the nodes contained into the grafted 
tree will be labeled in a different way. 
SHAZOO's running time is 0{\V\). 

Weighted Majority Vote (later called WMV) predicts the label of a node i 
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Figure 2: Red labels are predicted respectively by MU CCA in the first tree and 
SHAZOO in the second one 



using a majority vote on the labeled neighbors weighted on the edges connecting 
them to i. Its running time is 8(|i?|). 

Both LABPROP and GTG-ESS can be used for multiclass datasets but it is 
not clear if it is possible to modify SHAZOO in order to get a multiclass al- 
gorithm. In the multiclass section we replace SHAZOO with WMV. We did 
not include in our comparison WTA ( Cesa-Bianchi et al. 2010|) since it was 
always outperformed by SHAZOO. Graph Perceptron (Herbster et al. [2009 ) 
was omitted since it requires a lot of computational resource and performed 



poorly in previous comparisons (for example in |Cesa-Bianchi et al. [2010] ) 



5.1 Binary classification 

In order to create a fair comparison, for the experimental activity we will gen- 



erate our datasets in the same way of Cesa-Bianchi et al. 2011 



5.1.1 Datasets 

We choose two real-world and well-known datasets to generate our graphs: 
USPS and RCVl. 

USPS is a set of hand written digits collected by the United States Postal Ser- 
vice, it contains 9298 images of 16 x 16 pixels (gray scale); the dataset called 
RCVl is a subset of 10000 articles in chronological order taken from Reuters 
Corpus (a huge collection of news released by ). All the articles have been pre- 
processed using TF-IDF. Both datasets were natively multiclass, so we tested 
our binary predictors using a standard one-vs-rest schema. We have 10 binary 
experiments for USPS and 4 binary experiments for RCVl. 
We generated our graphs with as many nodes as the total number of examples 
for each dataset, keeping for each node only 10 nearest neighbor (before sym- 
metrization) using the Euclidean distance \\xi — Xj\\. Edges' weights have been 
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calculated as 

where a is the average between the weights of all the edges incident to i or j. 
5.1.2 Experimental results 
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Table 1: RCVl and USPS - Error rate of each algorithm. Results are averaged 
over the 4 and 10 binary problems. 

The results, shown in Table [T] are not conclusive, but we can observe some 
interesting trends: 

• GTG-ESS is the most accurate algorithm, but its computational com- 
plexity makes it not particularly suitable for most of the current practical 
applications. 

• MUCCA's accuracy is good and it is always close or better that SHA- 
ZOO's one. 

• LABPROP approaches the competitors as the training set size grows. 
5.2 Multiclass classification 

In the multiclass comparison we replaced RCVl with two other datasets, since it 
is not just multiclass but also multi-label (an element has more than one label). 
As we explained before, SHAZOO works just on binary problems, so the only 
scalable competitor in this section is WMV. 

5.2.1 Datasets 

• USPS is the same graph we used for the binary classification. 

• CARDIO consists of measurements of fetal heart rate (FHR) and uterine 
contraction (UC) features on cardiotocograms classified by expert obste- 
tricians. We generated a graph starting from the feature vectors, in the 
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same way we did for USPS and RCVl. The graph is constituted by 2126 
nodes and 13696 edges. The nodes are divided in 3 classes (fetal states IM 



• GHGRAPH is a graph created from the data released by Github.com for 
their contest in 2009. Every node represents a repository and an edge 
means that there is a developer working on both repositories connected 
by that edge. In case more than one developer works on those repositories, 
the number of developers is used as a weight for that edge. The label of 
each repository is the most used programming language to write the code 
in that repository. We kept only the biggest connected component of the 
graph that includes 99907 nodes (64885 of them are labeled) and 11044757 
edges. Each programming language is a different class and at the end we 
had 40 classes. 

5.2.2 Experimental results 
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Table 2: Multiclass experiments - Error rate of each algorithm in function of 
the train size. Please note that we did not run GTG-ESS and LABPROP on 
GHGRAPH due to limited computational resources. 

The results of our experiments, shown in Table [2] are not conclusive, but we 
can observe some interesting trends: 

• It is not really clear which one between GTG-ESS and LABPROP is the 
most accurate algorithm, but anyway MU CCA is always competitive with 
them. 

• MUCCA is always much better than WMV. As expected WMV works 
better on "not too sparse" graphs such GHGRAPH, but even in this case 
it is outperformed by MU CCA. 

• GTG-ESS and LABPROP 's time complexity did not permit us to run 
them in a reasonable amount of time with our computational resources. 

6 Conclusions and future work 

We introduced a novel scalable algorithm for multiclass node classification in 
arbitrary weighted graphs. Our algorithm is motivated within a game theoretic 

^The Cardiotocography datasct was created by |de Campos et al.| [2000| 
and now is freely available on the UCI website with a complete description: 
|http://archive.ics.uci.edu/ml/datasets/Cardiotocography] 
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framework, where test labels are expressed as the Nash equilibriiim of a cer- 
tain game. In practice, MUCCA works well even on binary problems against 
competitors like Label Propagation and SHAZOO that have been specifically 

designed for the binary setting. 

Several questions remain open. For example, committees of MUCCA predictors 
work well but we do not know whether there are better ways to aggregate their 
predictions. Also, given their common game-theoretic background, it would 
be interesting to explore possible connections between committees of MUCCA 
predictors and GTG-ESS. 
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